Skip to content

feat: ConvoMem sampled adapter with range-probe selective fetch#21

Merged
groksrc merged 1 commit into
mainfrom
feat/convomem-sampled
Jun 12, 2026
Merged

feat: ConvoMem sampled adapter with range-probe selective fetch#21
groksrc merged 1 commit into
mainfrom
feat/convomem-sampled

Conversation

@groksrc

@groksrc groksrc commented Jun 12, 2026

Copy link
Copy Markdown
Member

Summary

Adds the last Phase-1 benchmark: ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs), as a documented stratified sample. Pre-mixed test cases map 1:1 onto grouped runner mode.

Design

  • Selective fetch: the dataset is multi-GB (one size-300 batch is ~850MB). Batch files are ordered by context size within each category dir; an HTTP Range tail-probe (last 4KB) indexes every file's final contextSize without downloading. Only matching files are fetched. index.json records all probe results including files not downloaded — the selection is auditable. Probes throttled + retried (HF CDN resets rapid bursts; hit and fixed live).
  • Documented sampling: stratified by (category, contextSize), fixed seed, sampling.json records seed + per-stratum population/sample counts. A published number states exactly which slice it covers.
  • Anti-leakage: containsEvidence/model_name scrubbed, conversation ids remapped to neutral positional ids (covered by test).
  • Ground truth maps evidence conversation ids through the remap; abstention evidence referencing absent conversations yields empty ground truth by design.

Verification

  • 7 new tests: ground-truth mapping, leakage scrub, seed determinism (same seed → identical sample; different seed → different), stratification manifest, context-size filter.
  • Live run against the real dataset (user_evidence @ size 10): 50 files probed, exactly 2 downloaded, 3 cases → 30 docs/30 queries (size-10 cases pack 10 questions per shared haystack — one ingest serves 10 queries), all ground truth non-empty.
  • Full suite green (89 tests), lint clean.

🤖 Generated with Claude Code

ConvoMem (Salesforce, Apache-2.0, ~75K QA pairs) ships as pre-mixed
test cases — self-contained conversation haystacks plus questions —
that map 1:1 onto the grouped runner mode.

- datasets/convomem.py: the full dataset is multi-GB (one size-300
  batch is ~850MB), so fetching is selective. Batch files within each
  <category>/<N>_evidence/ dir are ordered by case context size, and an
  HTTP Range tail-probe (last 4KB) reads each file's final contextSize
  without downloading it. Only files matching the requested sizes are
  fetched; index.json records every probe result including files NOT
  downloaded, so the selection itself is auditable. Probes are
  throttled with retries (HF CDN resets rapid bursts).
- converters/convomem_to_corpus.py: stratified deterministic sampling
  by (category, contextSize) with a fixed seed; sampling.json records
  seed, per-stratum population, and sample counts so a published number
  states exactly which slice of ConvoMem it covers. Leakage scrub:
  containsEvidence/model_name dropped, conversation ids remapped to
  neutral positional ids. Ground truth maps evidence conversation ids
  through the remap; abstention evidence referencing absent
  conversations yields empty ground truth by design.
- CLI: datasets fetch --dataset convomem --context-sizes; convert
  convomem --sample-per-stratum/--seed/--context-sizes. justfile
  recipes + README section.

Live-verified against the real dataset (user_evidence, size 10):
50 files probed, exactly 2 downloaded, 3 cases sampled -> 30 docs /
30 queries (size-10 cases pack 10 questions per haystack, each
targeting a distinct evidence conversation), all ground truth non-empty
and remapped. 7 new unit tests; suite green.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Drew Cain <groksrc@gmail.com>
@groksrc groksrc merged commit caff4fc into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant